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ABSTRACT 

Hyperlinks are an essential feature of the World Wide Web. They 
are especially important for online encyclopedias such as Wikipe¬ 
dia: an article can often only be understood in the context of re¬ 
lated articles, and hyperlinks make it easy to explore this context. 
But important links are often missing, and several methods have 
been proposed to alleviate this problem by learning a linking model 
based on the structure of the existing links. 

Here we propose a novel approach to identifying missing links 
in Wikipedia. We build on the fact that the ultimate purpose of Wi¬ 
kipedia links is to aid navigation. Rather than merely suggesting 
new links that are in tune with the structure of existing links, our 
method finds missing links that would immediately enhance Wi¬ 
kipedia’s navigability. We leverage data sets of navigation paths 
collected through a Wikipedia-based human-computation game in 
which users must find a short path from a start to a target article 
by only clicking links encountered along the way. We harness hu¬ 
man navigational traces to identify a set of candidates for missing 
links and then rank these candidates. Experiments show that our 
procedure identifies missing links of high quality. 

Categories and Subject Descriptors 

H. 5.4 [Information Interfaces and Presentation]: Hypertext/Hy- 
permedia— Navigation 

General Terms 

Algorithms, Experimentation, Human Factors 

Keywords 

Navigation; browsing; Wikipedia; Wikispeedia; human computa¬ 
tion; link prediction 

I. INTRODUCTION 

The success of the World Wide Web hinges on the hyperlinks that 
weave its many billions of documents together. It is this fact that 
gave rise to its very name. Hyperlinks are essential for several rea¬ 
sons. From a human-computer interface perspective, they allow 
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(a) Target prediction 


(b) Source prediction 


Figure 1: In scenario (a) a source s is given, and the aim is to 
find mentions of relevant concepts in the source page and link 
them to appropriate targets. Here the set of candidate anchors 
is limited to the source document. In scenario (h) a target ar¬ 
ticle t is given, and the aim is to identify sources that contain 
relevant mentions of t and could benefit from linking to it. Here 
every document mentioning the target is a potential source. 


users to explore the available information in a natural way by ef¬ 
fortlessly following pointers to references. For instance, the act of 
clicking a Wikipedia link is negligible compared to the cumber¬ 
some effort of flipping through countless pages of a paper ency¬ 
clopedia, or even finding another volume on yet another shelf in 
a physical library. Links are also important from an information 
management perspective. They are among the most prominent fea¬ 
tures used by search engines, both for indexing and ranking. When 
building a Web index, anchor texts serve as infoimative descriptors 
of the target page they point to, oftentimes more so than page titles 
and content (ID. Moreover, in search-result ranking, several stan¬ 
dard methods, such as PageRank 0 and HITS (TS), rely on the 
Web graph induced by the hyperlinks. And finally, links are impor¬ 
tant from a content provider perspective, since they make content 
discoverable to users and search engines. If a document has no in¬ 
coming links, it cannot be accessed by a browsing user, nor can it 
be crawled and indexed by a search engine. 

Source vs. target prediction. While maintaining a good hyperlink 
structure is important, it is also difficult and time-consuming. We 
distinguish two tasks that typically arise during webpage mainte¬ 
nance, which we term target prediction and source prediction: 

• In the target prediction task (Fig. [Da)), a source document s 
is given, and the goal is to find mentions of relevant concepts 
in s and link them to appropriate targets t. 

• In the source prediction task (Fig. [Db)), a target document 
t is given, and the goal is to identify sources v that contain 
relevant mentions of t and would benefit from referencing t. 



















To illustrate these two abstract tasks, consider the following con¬ 
crete scenarios (schematized in Fig.[^. 

Target prediction: In a typical target prediction scenario, a trav¬ 
eler might write a blog post about his recent trip to Tuscany, in 
which he mentions several places, foods, landmarks, and historical 
persons. To provide more context, he wants to link these mentions 
to external resources, such as Wikipedia articles, online recipes, 
or hotel websites. Manually identifying the relevant concepts and 
linking them to the most appropriate target pages can be a tedious 
process |11||28| . 

Source prediction: A typical source prediction scenario might 
involve a software engineer in a large company who has just fin¬ 
ished a piece of code that could increase the productivity of many 
colleagues. She also created a documentation and tutorial page, but 
for it to be visible, she needs to link to it from other pages that 
colleagues interested in her code are likely to visit, such as com¬ 
pany-internal wikis, Q&A fora, etc. Along the same lines, consider 
a Wikipedia editor who has just written a new article. The article 
is of little use if it is not reachable from other articles, so the editor 
wants to plant links into other articles to point to the new article. 

Identifying appropriate sources for a given target is even more 
difficult than identifying appropriate targets for a given source: In 
target prediction, the set of candidate anchors is limited and can 
be identified by inspecting the source document. In source pre¬ 
diction, on the contrary, the set of candidate sources is practically 
unbounded, as any page on the Web is a potential candidate to link 
to the target. 

Existing approaches. Automatic methods for detecting missing 
links would be important and useful. Previous work has proposed 
methods mainly for the target prediction problem; i.e., they an¬ 
notate a given source document with links to external resources, 
primarily Wikipedia articles. One class of techniques can process 
arbitrary plain-text documents. Here the text of the input docu¬ 
ment is combined with background knowledge from Wikipedia’s 
textual content as well as graph structure to predict outgoing links 
|22[|23[[26) . A second group of approaches takes Wikipedia arti¬ 
cles as input and uses the already existing links (and possibly the 
text content) to predict further outgoing links for the input article, 
e.g., based on adjacency-matrix factorization |44||45| , information 
retrieval techniques |13||51| , or machine learning |29|. 

With regard to the source prediction problem, a very simple ap¬ 
proach would be to first collect all anchor texts in the document 
collection that frequently link to the target t and to then link all as 
yet unlinked occurrences of these anchor texts to t as well. Unfor¬ 
tunately, this method is too simplistic and suffers from some major 
drawbacks: first, a phrase might not be link-worthy in every con¬ 
text (e.g., ‘flower’ is a good link anchor in the context of botany, 
but not in that of a wedding, where the concept of a flower needs no 
further explanation); second, the disambiguation problem is not ad¬ 
dressed (e.g., ‘Florence’ should link to FIRENZE in most contexts, 
but to FLORENCE, ALABAMA in the context of Lauderdale County, 
Alabama). Hence more sophisticated algorithms are required for 
solving the source prediction task. 

Independent of whether the source or target prediction task is 
considered, what is largely missing from the picture is the realiza¬ 
tion that, beyond text content and graph structure, there are addi¬ 
tional sources of data that could be utilized in order to detect miss¬ 
ing links more accurately. In particular, one such source of data 
that has remained mostly unexplored is human navigational traces 
on websites. Such traces are captured by the usage logs recorded on 
the server side by many websites, and the question arises: How can 
human click trails be harnessed in order to detect missing links? 



Figure 2: The final portions of several navigation paths with 
the same target t = inflammation. The unfilled nodes are 
Wikipedia articles that appeared on paths to t. The number in 
each node indicates the percentage of paths with target t that 
passed through that node. 


Present work: Navigation logs for mining missing links. In this 
paper we explore the use of human navigational traces for detecting 
missing links in websites like Wikipedia. Logs of website naviga¬ 
tion contain strong signals with regard to which existent hyperlinks 
are useful: from a user interface perspective, if a link is traversed 
often by humans then it is useful, and if it is never traversed by 
humans then it is redundant (although from a search engine’s per¬ 
spective it could still be useful for indexing and ranking). Hence 
we expect such logs to also contain clues into whether an as yet 
non-existent link should be there or not. For instance, if we often 
observe users going through page ^ and ending up in page f, al¬ 
though s does not directly link to t, then it might be a good idea to 
introduce a ‘shortcut’ link from i to t. 

As an analogy, consider the task of improving a road network. A 
civil engineer would not just look at the existing road segments and 
try to infer which road segments to build next. Rather, she would 
take into account how heavily each road segment is used and would 
then decide where it would make sense to add a shortcut, an extra 
lane, or a traffic light. We argue that similarly we should consider 
how the Web’s hyperlink structure is used and decide on that basis 
what hyperlinks to add next. The raison d’etre of hyperlinks is to 
enable navigation, so by creating hyperlinks that aid navigation, we 
are optimizing the right objective. 

Proposed approach to source prediction. Here we propose a 
method for using navigational data to discover missing links fol¬ 
lowing the above intuitions, thereby addressing the source predic¬ 
tion problem; i.e., given a target page, we find good sources to 
link to the target. We demonstrate the effectiveness of our ap¬ 
proach using Wikipedia. We chose Wikipedia as our proof-of-con- 
cept domain because high-quality navigation logs are available for 
it, collected via a class of human-computation games known un¬ 
der names such as The Wiki Game or Wikispeedia |39||43| . In 
these games, users are given two Wikipedia articles—a start and 
a target —and need to find a short path from the start to the target 
by traversing links encountered in the visited articles. The under¬ 
lying graph structure is unknown to users; they only see the out¬ 
going links of the page they are currently on. But, crucially, they 
also have expectations on which pages should link to which other 
pages, based on their commonsense and expert knowledge about 
the world, and are guided by these expectations toward articles they 
consider likely to contain links to the target. 








We consider our approach to be general and applicable to web¬ 
sites other than Wikipedia. Also note that we simply use naviga¬ 
tional traces from the Wikipedia games since they are readily avail¬ 
able to us. Obtaining raw, passively collected browsing logs of 
Wikipedia is much harder due to privacy considerations. However, 
we are encouraged to believe that our approach will generalize to 
passively collected browsing logs as well. 

Building on the above intuition about humans browsing the Web, 
we reason as follows: if page s is traversed by many users in search 
of target f, then this is an indicator that users expect the link from j 
to t to exist. So if i does not link to t yet (or not any more, for that 
matter), but contains a phrase that could be used as an anchor for t, 
then we should consider the link (s,t) for addition. 

As a concrete example, consider Fig. The figure summarizes 
several navigational paths, all with the target t = INFLAMMATION. 
Paths progress from bottom to top, and only the last few clicks are 
shown per path. Each node s also contains the fraction of all paths 
with target INFLAMMATION that passed through j. For instance, 
we see that 17% of times INFLAMMATION was reached from IN¬ 
FECTION and 13% of times it was reached from ALLERGIC RE¬ 
SPONSE. A considerable fraction of paths (15%) passed through 
ACUTE (MEDICINE), which does not link to f, although it mentions 
t several times and could clearly benefit from a link to it. 

The central part of our approach is that we mine many link candi¬ 
dates (s,t) from a large number of navigation traces for each target 
t and then rank these candidates by relevance. 

We perform a set of experiments using automatically (and thus 
only approximately) defined ground-truth missing links, as well as 
an evaluation involving human raters. In our automatically de¬ 
fined ground truth, we consider as positive examples of missing 
links such links that existed for a substantial amount of time but 
are missing from the latest Wikipedia snapshot. In our evaluation 
by humans, raters labeled the identified missing links as relevant 
or not. Experiments show that restricting the candidate set to pairs 
observed in paths and then ranking those candidates using a simple 
heuristic performs better than applying more sophisticated ranking 
methods to the set of all possible candidates (i.e., including those 
not observed in paths). The reason why simple ranking methods 
suffice is that the ‘heavy lifting’ is done by the users before rank¬ 
ing, by using vast amounts of world knowledge to select the pages 
that are best-suited to link to the target. 

The remainder of the paper is structured as follows. Sec. IHp ro- 
vides details about the navigation trace data set and our methods 
for candidate selection and ranking. We present experimental re¬ 
sults in Sec. 1^ and provide further discussion and perspectives in 
Sec.|^ Related work is reviewed in Sec.[^ and Sec. [^concludes 
the paper. 

2. USING HUMAN NAVIGATION UOGS 
FOR MINING MISSING UINKS 

In this paper we address the source prediction task of Eig. m- 
Given a target page t, we suggest a good list of source pages i- that 
should link to t. Although this paper is focused on the special case 
of Wikipedia, we believe that our approach is general enough to be 
applied to other websites beyond Wikipedia. We discuss implica¬ 
tions and requirements of this extension in Sec.|^ 

We selected Wikipedia as our domain because, first, it consti¬ 
tutes an important and relevant special case due to its reliance on 
the links between articles, and because, second, high-quality nav¬ 
igation traces are available for it. In this section we first describe 
how these data were collected and then give a detailed account of 
how we use them for source prediction. 


2.1 Data sources 

Our data sets of Wikipedia navigation traces were collected via a 
popular online game that is generically known as ‘Wikiracing’ |50| . 
Several websites offer versions of this game, such as The Wiki 
Game or Wikispeedia |39||43| , but they all share the same gen¬ 
eral idea: a user is given two Wikipedia articles—a start and a 
target —and is asked to navigate from the start to the target by ex¬ 
clusively clicking hyperlinks contained in the visited pages. We 
also refer to start-target pairs as missions. In our experiments we 
use data from both The Wiki Game (Sec. ED and Wikispeedia 
(Sec. |3.2| l. Before describing how exactly we do so, we provide 
more details about the two data sets. 

The Wiki Game. In The Wiki Game, users may choose from five 
challenges: ‘least clicks’ (minimize the number of clicks), ‘speed 
race’ (minimize time), ‘five clicks [or fewer] to Jesus’ (find JESUS 
in five or fewer clicks, minimizing time), ‘no United States’ (mini¬ 
mize time while avoiding the USA article), and ‘six degrees of Wi¬ 
kipedia’ (minimize time while finding a path of exactly six clicks). 
In each challenge, several players compete for the same mission 
simultaneously, navigating the full Wikipedia. We pool the paths 
collected from all five challenges between 2009 and 2012, thus ob¬ 
taining a data set of 974k paths grouped into 364k distinct missions 
(start-target pairs); i.e., there are 2.7 paths per mission on average. 
The number of distinct targets is 3,219, i.e., we have 303 paths per 
target on average, with a median of 208. Targets with many paths 
are quite frequent; e.g., there are 2,087 targets with at least 100, and 
698 targets with at least 500, paths. In the following, we focus on 
the 2,087 targets (65% of all targets) that have at least 100 paths. 

Wikispeedia. Unlike The Wiki Game, Wikispeedia is a single¬ 
player game. Once a mission is successfully completed, the user 
may enter her name into a high-score table associated with that 
mission, where users are ranked by number of clicks, with ties bro¬ 
ken by time. The game is played on a reduced, static snapshot 
of Wikipedia containing 4,604 of the most important articles (48| . 
The data set we work with is publicly available |40| and comprises 
51k paths collected from 2009 to 2014, grouped into 29k distinct 
missions, for an average of 1.8 paths per mission. The number of 
distinct targets is 3,326; i.e., we have 15 paths per target on average, 
with a median of 10. 

Comparing the two data sets, we notice that The Wiki Game, on 
the one hand, has the advantage of more data, in particular more 
paths per target. Wikispeedia, on the other hand, has the advan¬ 
tage of using a static Wikipedia snapshot (while The Wiki Game 
fetches pages from live Wikipedia on the fly and caches them for 
some time [By), which allows for a different kind of evaluation 
(Sec. |3.2^ , and of being publicly available, which makes our exper¬ 
iments reproducible. 

2.2 Source candidate selection 

Our method for source prediction consists of four steps, as illus¬ 
trated in Eig.[^ (The fourth step, ranking, is covered in Sec. |2.3| ) 

Step 1: Collect paths. We start by collecting navigation paths with 
the given target f, up to time T. 

Step 2: Generate pairs. For each path p = {po,-.. ,Pn = t), the 
initial set of candidates is {{pi,t) : 0 < i < «}, i.e., every direct link 
from any page along the path to the target t is initially a candidate. 
(The start page po is exempt, since it is selected randomly and is 
therefore unlikely to be a useful candidate.) There are in general 
many paths for the same target t (upper left box in Fig. [^, so we 
take the union of the candidate sets resulting from all these paths 
(lower left box) as the initial candidate set for t. 


Paths with target t 



Figure 3: Overview of our approach for mining missing hyperlinks to a given target t from human navigation traces. (1) Collect 
paths with target t up to time T, and captnre the reference Wikipedia snapshot Wj at time T. (2) Generate source-target pairs. 
(3) Filter the pairs based on Wt- a pair (j.f) becomes a candidate if i mentions, but does not link to, t in Wt\ also exclude (i,r) if j 
tends to appear in the second half of paths with target t. (4) Rank the candidate links. 


Step 3: Filter. Next, we filter this initial set using the Wikipedia 
version Wj at time T, which serves as our reference snapshot. A 
link {s,t), where s e {pi,...can be suggested only if it 
does not already exist in Wj. Further, the source s should contain 
a phrase that could serve as the anchor for a link to t; in other words, 
^ should mention t in Wj. 

To detect pages that mention the target f, we construct the set At 
of all phrases that serve as anchor texts for t across all articles in 
the reference Wikipedia snapsho0and subsequently define that j 
mentions t if it contains any phrase from At . 

Previous work |42| has shown that navigation traces tend to con¬ 
sist of a ‘getting-away-from-the-start’ phase, in which the user at¬ 
tempts to reach a hub article that is well-connected in the network 
of pages, and a ‘homing-in-on-the-target’ phase, in which the user 
actively seeks out pages related to t. Guided by this finding, we 
apply one additional filtering step and include in our final candi¬ 
date set (the box labeled ‘candidate links’ in Fig.|^ only sources 
that tend to appear in the second half of paths with target t. More 
precisely, we first define the relative path position of p; along the 
path p = (po, pn = t) to be i/n, and then discard the pair {s,t) 
if the relative path position of s on paths with target t is less than or 
equal to 0.5 on average. 

2.3 Source candidate ranking 

Source candidate selection yields an unordered set of candidates for 
each target t. The goal of the next (and final) step in our pipeline is 
to turn this set into a meaningful ranking (step 4 in Fig.[^. Since 
the source prediction task (Fig. [T|b)) asks for sources for a given 
target f, we produce a separate ranking for each t. Several ranking 
methods are conceivable: 


Since we deal with Wikipedia as our data set, we choose re¬ 
latedness measures based on Wikipedia (see below). 

2. Ranking by path freqnency. Navigation traces provide us 
with statistics about how frequently a source i was traversed 
by users searching for target t. Based on this, we compute 
the path frequency of i given target r, defined as the fraction 
of paths that passed through s, out of all the paths with target 
t. Intuitively, pages s that were traversed more frequently on 
paths to t should be better sources for links to t. 


We experiment with two relatedness measures for case 1 above. 
The first is due to Milne and Witten |25| and is based on the inlink 
sets S and T of s and r, respectively. It calculates the distance 
between i and t as the negative log probability of seeing a link from 
5 n T when randomly sampling a link from the larger one of the 
sets S and T (normalized to approximately lie between 0 and 1), 
and the relatedness as one minus that distance: 


log(max{|5|,|r|})-log(|5nr|) 

log(A)-log(min{|5|,|r|}) 


( 1 ) 


where N is the total number of Wikipedia articles. 

The second relatedness measure is due to West et al. |44| and 
works by finding a low-rank approximation of Wikipedia’s adja¬ 
cency matrix via the singular-value decomposition (SVD). The pair 
(s,t) corresponds to an entry in the adjacency matrix A and 
to an entry in the rank-A: approximation A/, obtained from 

A via SVD. If A[.s,?] = 0 and Aj.[j,f] ^ 0 then i does not link to t 
yet but is a good candidate. Therefore we define the SVD-based 
relatedness as 


1. Ranking by relatedness. It seems reasonable to rank source 
candidates s by their relatedness to f, since clearly a link 
is more relevant between articles with topical connections^ 

*In practice, we exclude (1) phrases that rarely (less than 6.5% of 
all cases |26| ) serve as link anchors for any target, which excludes, 
e.g., ‘A’ as an anchor for AMPERE, and (2) anchor texts for which 
t is seldom (less than 1% of all cases) the target, which excludes, 
e.g., ‘Florence’ as an anchor for FLORENCE, ALABAMA. 
^According to the Wikipedia linking guidelines (49) , links should 
correspond to ‘relevant connections to the subject ot mother article 
that will help readers understand the article more fully.’ 


SVD(j,r) =A,fc[i,r]-A[i,f]. (2) 

In our experiments on The Wiki Game, we use the reduced rank k = 
1,000. Since the adjacency matrix is much smaller for Wikispeedia 
(Sec. |2.1| l, we use the smaller value oik = 256 there. 

2.4 Exploratory analysis of link candidates 

Having introduced the data set and our source prediction method, 
we now explore the data set of human navigation traces to build 
intuitions on strengths and potential weaknesses of our approach. 

We use the Wikipedia version as of T = 2014-01-02 as our ref¬ 
erence snapshot Wt in all experiments. 














Number of pages on a path mentioning the target. We count for 
each path p = {po,... ,p„ =t) how often the target t is mentioned 
across all visited nodes (excluding the randomly se¬ 

lected start page pg) and find that, on average, t is mentioned on 1.7 
pages per path. Since p„_i contains a link to t, it is very likely to 
also mention t (for our definition of a mention, cf. Sec. |2.2[ l, which 
means that, on average, each path contains 0.7 additional pages that 
mention t. 

Now consider the subset of visited pages that mention t. Out of 
these, 73% contain a link to t in the reference Wikipedia snapshot 
Wy-. The remaining 27%, which do not link to t in Wy, are po¬ 
tentially good candidate sources to link to t, since these pages were 
actively chosen by the user while searching for t. 


Properties of pages along paths. We also investigate which parts 
of a path carry most value for source prediction. Consider Fig. |4(a)[ 
which aggregates all paths and shows for each part of the path how 
likely the pages in that part are to mention t. In order to be able to 
aggregate paths of variable length, we adopt the notion of relative 
path position (Sec. 2.2 i. Fig. 4(a) uniformly buckets the range [0,1] 
into five intervals and plots the average for each interval. We only 
include paths of at least five clicks, such that each path contributes 
to each bucket, and the page p„^ i just before the target always falls 
into the last bucket. 

We see that target mentions become more frequent as paths prog¬ 
ress (the black curve in Fig. |4(a)| l: two-thirds of pages with relative 
path positions in the interval [0.6,0.8) mention the target, while at 
positions in [0.8,1.0) nearly all pages (91%) do. We are particu¬ 
larly interested in mentions that are not accompanied by a link to t 
(the magenta curve in Fig. |4(a)^ , since these are our source candi¬ 
dates. The figure tells us that candidates are more likely to appear 
towards the end of paths: at relative path positions in the interval 
[0.6,0.8), 30% of pages without a link to t mention f, and for the 
interval [0.8,1.0), the fraction is as high as 46%. 

We note that these curves are in tune with previous work |42| , 
which has shown that humans tend to follow a ‘semantic gradi¬ 
ent’ during information network navigation, passing through arti¬ 
cles that get ever more related to the target. In this light it makes a 
lot of sense that the rate of target mentions should increase as paths 
progress. 


2.5 Obtaining ground truth based on Wikipe¬ 
dia evolution 

In order to form intuitions about how meaningful our suggestions 
are, we would ideally like to evaluate for each relative path po¬ 
sition how good the source candidates at that position are. How¬ 
ever, ground-truth data is hard to come by; in order to make strong 
claims, we need to ask humans how good our predictions are. We 
do so later on (Sec. |3.1.2^ , but since obtaining human ratings is 
expensive and time-consuming, we preliminarily adopt a notion of 
ground truth that is approximate and biased, but nevertheless al¬ 
lows us to gain some initial insights. In this subsection, we define 
this approximate ground truth and analyze our navigation traces in 
terms of it. 

We obtain a weak notion of ground truth from the evolution of 
Wikipedia’s graph structure as follows. First we define the link 
rate of (j,t) as the fraction of time s contained a link to t since 
i- was created. (We compute these values based on Wikipedia’s 
complete edit history.) Then we choose a link-rate threshold a e 
[0%,..., 100%] and label a candidate link [s,t) as positive if its link 
rate is greater than a. Candidates with a positive label correspond 
to links that existed for a substantial amount of time, but got deleted 
before the reference Wikipedia snapshot Wy {cf. step 3 of Sec.|2.2^. 
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Figure 4: (a) Fraction of sources s mentioning target t, as a 

function of the relative path position of s\ the magenta curve is 
conditioned on j- not linking to t. (h) Fraction of candidates that 
are positive according to the automatically generated ground 
truth, as a function of relative path position, for several link- 
rate thresholds a (Sec. |2.5^ ; source candidates mention, but do 
not link to, the target and are considered positive if their link 
rate is greater than a. 


That is, such links could have been valuable for navigation, yet 
were removed at some point in time, so we argue that reintroducing 
them is likely to improve Wikipedia. 

Consider a candidate {s,t) labeled as positive according to the 
above definition. The link (s,t) may (case 1), or may not (case 2), 
have existed during the game from which it was mined. Further, if 
it existed during the game, it may (case la), or may not (case lb), 
have been clicked by the user. These three cases correspond to the 
following scenarios. If the link {s,t) existed during the game and 
was clicked by the user (case la), but has been deleted since (as 
required for (i, t) to be a candidate), then it is probably a good idea 
to suggest it for reintroduction. If the link existed during the game, 
but was not clicked by the user (case lb), this means that she did 
not see it in her rush to reach t as fast as possible (or else she would 
have clicked on it to immediately win the game); so either the user 
found another promising way to continue the search before seeing 
the link to t, or the link was too hard to find in the text of i, which 
is a signal that we should reintroduce that link and make it more 
obvious. Finally, if {s,t) did not exist during the game (case 2) then 
the user could not possibly have taken it, although she might have 
intended to do so (since she actively navigated to s while searching 
for t); in this case, too, (i,f) might be a good link suggestion. 

We refer to our automatically obtained labels as ‘weak’ because, 
by definition, they contain many false negatives. Wikipedia is an 
evolving organism, and an important part of our task is to suggest 
links which never existed. However, by the above link-rate thresh¬ 
old criterion, these links will be counted as negative examples. For 
an example of such a false negative, consider again Fig. where 
the article on ACUTE (MEDICINE) should clearly link to INFLAM¬ 
MATION, as it explains a concept critical to understanding the term 
ACUTE as used in medicine, but the link from ACUTE (MEDICINE) 
to INFLAMMATION is labeled as negative by the automatic ground 
truth, since ACUTE (MEDICINE) has never linked to INFLAMMA¬ 
TION in Wikipedia’s history. In other words, the automatically 
obtained ground truth has high precision, but low recall of truly 
positive examples. Nonetheless, this weak ground truth is useful 
during development because it provides us with many labeled ex¬ 
amples for free and allows for relative comparisons between differ¬ 
ent methods. 

Fig. |4(b)| captures this approximate notion of candidate quality, 
again broken up by relative path position. The graph shows that 
the fraction of positives obtained from the automatically obtained 
ground truth becomes higher for pages appearing later on in paths. 











We conclude that not only are mentions at later positions more fre¬ 
quent (Fig. |4(a)^ , but that they also correspond to better link an¬ 
chors. (We try several values for the link-rate threshold a, but the 
same trend holds for all thresholds.) This provides additional justi¬ 
fication for our decision to include in our set of source candidates 
only sources that tend to appear in the second half of navigation 
traces with the given target (Sec. |2.2[ l. 

3. EVALUATION 

In our experiments we compare five methods: Given a target t, we 
can either consider as source candidates the set of all articles that 
mention t but do not link to it (across our entire reference Wikipe¬ 
dia snapshot W/-); or we can subselect candidate sources based on 
whether we observe them in navigation paths (Sec. |2.2|. F urther, 
we consider two relatedness measures for ranking (Sec. |2.3^ . This 
yields four combinations of candidate selection methods (‘none’ 
and ‘path-based’) and relatedness measures (‘MW’ and ‘SVD’). 
The fifth method requires no external relatedness measure but sim¬ 
ply ranks candidates with respect to their frequency among paths 
with target t (Sec. |2.3[ l. 

To sum up, we consider the following five methods for predicting 
missing links to a given target page f. 

• No selection, rank by MW: Use all candidate sources and 
rank them based on the MW method (Eq.[TJ. 

• No selection, rank by SVD: Use all candidate sources and 
rank them based on the SVD method (Eq.[^. 

• Path-based selection, rank by MW: Only use candidates 
appearing in navigational traces and rank them based on the 
MW method. 

• Path-based selection, rank by SVD: Only use candidates 
appearing in navigational traces and rank them based on the 
SVD method. 

• Path-based selection, rank by frequency: Only use can¬ 
didates appearing in navigational traces and rank them based 
on the frequency with which they appear in paths {cf. Sec. |2.3| 
for our definition of path frequency). 

Ground truth. We perform a twofold evaluation, one based on the 
automatically obtained and approximate labels defined in Sec. |2.5| 
(we use the link-rate threshold a = 30% throughout), the other 
based on labels obtained from human raters. For our human evalu¬ 
ation, we select a subset of targets, predict sources for them using 
the methods that performed best during the development phase on 
the automatic ground truth, and ask raters on Amazon Mechanical 
Turk to label the top predictions. Here we get rid of the short¬ 
comings of the automatic ground truth, on which we cannot obtain 
absolute performance numbers (mainly due to the high false-nega¬ 
tive rate; Sec. |2.5[ l, but have less data to work with. 

We perform an evaluation by humans only on the predictions 
obtained on data from The Wiki Game. 

Evaluation metric. As our evaluation metric, we use precision@k 
for k= We first calculate the K precision values for each 

target separately and then compute the aggregate value for each k 
by averaging over all targets. This means we can only include tar¬ 
gets for which our methods find at least K source candidates (which 
naturally shrinks the set of test targets). We use A' = 10 for The 
Wiki Game, which defines our evaluation set of 699 targets. Since 
the Wikispeedia data set contains fewer paths, we are less restrictive 
here and choose K = 5, obtaining an evaluation set of 181 targets. 


Candidate 

Rank by 

Rank by 

Rank by 

selection 

MW 

SVD 

path freq. 

None 

39% 

20% 

N/A 

Path-based 

43% 

32% 

39% 


Table 1: Area under the precision®^ curve for no candidate 
selection versus path-based candidate selection for all ranking 
measures (The Wiki Game; Fig.|5(^. Note that path frequency 
is only applicable for path-based candidate selection. 


The bulk of our experiments is performed in Sec. ED on data 
from The Wiki Game. Subsequently, Sec. |3.2| completes the eval¬ 
uation by demonstrating that our algorithm works equally well on 
Wikispeedia. 

3.1 Evaluation on The Wiki Game 

We start by evaluating our algorithm on data from The Wiki Game, 
first based on the automatically obtained ground truth, then by ask¬ 
ing human raters. 

3.1.1 Evaluation using automatically obtained 
ground truth 

The precision curves for all five methods as evaluated on The 
Wiki Game are displayed in Fig. |5(a)[ and their performance is 
summarized in terms of the area under the precision curve in 
Tabled 

Overall, we achieve good performance, especially given that our 
ground truth is of high precision but low recall, with many false 
negatives. Even though the precision®^ lies in the range between 
0.4 and 0.5 (for path-based candidate selection and MW ranking), 
manual error inspection revealed that most suggested links make 
sense and are truly missing, and that, in fact, the Wikipedia com¬ 
munity has simply never included these links into the Wikipedia 
graph so that they could have made their way into our ground truth 
{e.g., (acute (medicine), inflammation) in Fig. I^. 

Comparing the different methods, we observe that path-based 
candidate selection performs better than doing no subselection for 
both relatedness measures used in ranking. Path-based selection 
improves performance by a particularly large margin for the SVD- 
based ranking method, which has much lower precision ®k than 
the other methods. This establishes the fact that there is a lot of 
value in path-based candidate selection especially when the ranking 
measure does not excel by itself. 

The margin between path-based selection and no selection is 
larger for smaller k, which means that considering navigational 
paths is particularly useful for predicting the top link sources. 

Note that both relatedness measures (MW and SVD) use the 
high-quality link structure of the Wikipedia page graph (Sec. |2.3^ . 
If we wanted to generalize our approach to domains beyond Wiki¬ 
pedia, we can easily imagine scenarios where no such high-quality 
relatedness measures are readily available {e.g., when pages are not 
as topically coherent as Wikipedia articles, or pages have scarce 
content and are poorly interlinked). With such situations in mind, it 
is encouraging to see that our fifth measure (‘rank by frequency’ in 
combination with path-based candidate selection; the yellow curve 
in Fig. |5(a)[ l performs quite competitively. Recall that that rank¬ 
ing method does not rely on any external relatedness measure but 
simply ranks source candidates with respect to the frequency with 
which they appeared on paths with target t. This is an important 
observation because it means our method has the potential to gen¬ 
eralize well to use cases where a good relatedness measure is not 
readily available. 
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Figure 5: Performance in terms of precision®^ for different sonrce selection and ranking methods on onr two data sets. Bold 
lines represent path-based candidate selection, (a) The Wiki Game, antomatically obtained ground truth (Sec. |3.1.1) . (b) The Wiki 
Game, human ground truth (Sec.|3.1.2^; only the MW ranking method was used in the human evaluation. Note that performance as 
evaluated by humans exceeds the estimate from the automated evaluation (Fig. |5(a)k i.e., the latter underestimates the actual quality 
of suggested links, (c) Wikispeedia, automatically obtained ground truth, same evaluation methodology as applied to The Wiki Game 
(‘standard evaluation’; Sec.|3.2^. (d) Wikispeedia, automatically obtained ground truth, stricter evaluation (Sec.|3.2^. 


Reintroduction of valuable but deleted links. By construction, 
the last click on a path always leads into the target. The fact that a 
user looked for, found, and clicked on this link is a very strong sig¬ 
nal that the link is useful for navigation. Removing such links from 
Wikipedia is particularly harmful from a user-interface perspective, 
and it is desirable that a source prediction method suggest them for 
reintroduction. To see if our path-based candidate selection method 
meets this desideratum, Fig. |6(a)| plots, for each rank k, the fraction 
of predicted links that were also the last link on the paths they were 
mined from. We observe that, while most suggested links were not 
clicked by humans (most likely because they were not present in 
the version of the Wikipedia page used by The Wiki Game), a sub¬ 
stantial fraction (between 20% and 35%) correspond to links that 
existed at game time and were chosen by the user but do not exist 
in the reference snapshot any more. We conclude that our top sug¬ 
gestions are often links that were taken by the user as the last click 
to the target but have since been removed, and thus our method 
rightfully reintroduces such links back into Wikipedia. 

Total volume of added links. So far we have conducted a per- 
target evaluation, by first computing precision @k values for each 
target and then averaging over all targets for each rank k. But it is 
also interesting to consider the total number of links we can sug¬ 
gest at a given precision level, across all targets, since this gives us 
an idea of the potential number of improvements we could make 
to Wikipedia by deploying our system. The results of this eval¬ 
uation are presented in Fig. |6(b)| which shows that we can make 
1,000 link suggestions at a precision of 42%, and 10k suggestions 
at a precision of 30% (ranking candidates by frequency and assum¬ 
ing the same link-rate threshold a = 30% used in our automated 
evaluation, corresponding to the red curve in Fig. |6(b)[ (. 

3.1.2 Evaluation by human raters 

Wikipedia is a continuously evolving entity. Although the link his¬ 
tory, on the basis of which we defined our automatic ground truth, 
captures this evolution, it can only tell us which links are positive 
examples (because they persisted throughout a long period of time). 
However, there are many links that should be, but have never been, 
added to Wikipedia, and if our method suggests such a link, then 
the previous evaluation would count it as a bad suggestion. There¬ 
fore the above notion of ground truth suffers from false negatives. 
To combat this problem, we perform a more accurate evaluation 
by human raters in this section. Having done so, we can also con¬ 
firm the prevalence of false negatives post hoc (see the end of this 
subsection). 



Rank k Number of top link suggestions 

(a) Fraction of final clicks (b) Volume vs. precision 


Figure 6: (a) Fraction of clicks that were the final click along 
their respective path, as a function of rank. If a suggestion cor¬ 
responds to the final click along a path, the suggested link must 
have existed at game time, so a useful link is effectively rein¬ 
troduced into Wikipedia by that suggestion, (b) Precision as a 
function of the total number of top link suggestions made by 
our method across all targets, where suggestions are ranked by 
frequency (Sec. |2.3t . Positive suggestions are those whose link 
rate lies above the respective link-rate threshold a (Sec.|2.5^. 


Methods compared via human evaluation. In our human eval¬ 
uation, we compare two of the top-performing methods: (1) path- 
based candidate selection with MW ranking and (2) no candidate 
selection with MW ranking. By using the same ranking method and 
only switching whether path-based candidate selection was per¬ 
formed, we can gauge the impact of the latter on performance. 

Target sampling. In order to select targets on which to evaluate 
the predictions of the two methods, we stratify the base set of 699 
targets by the number of paths observed per target and select ten 
targets from each decile, for a total of 100 test targets. The ratio¬ 
nale behind stratification is that we want to avoid being biased to¬ 
wards targets for which the path-based candidate selection method 
can make a particularly large number of good predictions (because 
more data are available for those targets). 

Obtaining ratings through Amazon Mechanical T\irk. We use 

Amazon Mechanical Turk Q for recruiting human raters. As in the 
automatic evaluation, our goal is to assess the precision @k, where 
k = 1,..., 10, for the two compared methods. In each rating task, 
the human evaluator was presented with a target t and a set of 14 
candidate sources and was asked to indicate which of the candidates 
should contain a link to the target article. There were no constraints 























Candidate 

Automatic 

Human-labeled 

selection 

ground truth 

ground truth 

None 

39% 

38% 

Path-based 

43% 

55% 


Table 2: Area under the precision®^ curve for MW ranking, 
comparing the automated (Fig. |5(a)^ and human (Fig. |5(b)^ 
evaluations of our method run on data from The Wiki Game. 

on the number of source articles the rater could choose. The set of 
14 candidate sources comprised the following entries: 

1. Five predictions from each of the two compared methods 
(either suggestions 1 through 5 or suggestions 6 through 10 
from each method). 

2. Two control sources, sampled randomly from the set of all 
Wikipedia articles that link to the target t. 

3. Two control non-sources, sampled randomly from the set of 
all Wikipedia articles and hence highly unlikely to link to t. 

In cases where the two methods agreed on a suggestion, that 
suggestion was included only once in the set of source candidates, 
thereby making the presented list shorter than the maximum of 14 
items. Also, to prevent any ordering bias, we shuffled the order 
of sources in the presented list. The task description is reproduced 
verbatim in Appendix [A| 

We paid 50 per task, and each task was presented to ten different 
workers. We consider a source to be a positive example if over half 
of the ten raters labeled it as such. 

Fig. |5(b)| presents the results. We observe that path-based can¬ 
didate selection followed by MW ranking outperforms MW rank¬ 
ing on the set of all candidates by a large margin. Table which 
summarizes the performance of both compared methods on the hu¬ 
man-labeled ground truth (again as the area under the precision 
curve) and compares it to the performance obtained on the au¬ 
tomatically labeled ground truth, shows that the area under the 
precision curves for path-based candidate selection increases by 
12%, compared to the automatically obtained ground truth. On the 
other hand, when doing no candidate selection, the area under the 
curve decreases by 1%. 

The reasons for the increased performance on the human-labeled 
ground truth are twofold. First, the automatically obtained ground 
truth uses only a historical notion of correctness in which many ac¬ 
tually positive examples are mislabeled as negative. Second, MW 
relatedness alone, without performing path-based candidate selec¬ 
tion, might not capture the notion of human-intuition-based simi¬ 
larity well. Path-based selection, on the contrary, captures exactly 
that quality by design, and it is thus not surprising that it prevails 
on a human-labeled ground truth by such a large margin. 

Out of the controls that represent randomly selected sources al¬ 
ready linking to the target page (item 2 in the above list of source- 
candidate types presented to raters), only 9% are labeled as positive 
by more than five of the ten raters, a value much lower than even 
our precision® 10 of about 50%. This tells us that the links we 
suggest are better than the average pre-existing link to the target. 

Finally, out of the random control non-sources (item 3 in the 
above list), only one pair (GEOGRAPHY OF KOREA to SOUTH KO¬ 
REA) was rated positive by more than five of the ten raters (we 
happened to sample a connected pair here). This statistic confirms 
that human labeling was not random. 

False negatives in the antomated ground truth. Now that we 
have human-labeled data, we can quantify the prevalence of false 
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Figure 7: Histogram of average human labels for examples la¬ 
beled as negative by the automatically obtained ground truth, 
highlighting the prevalence of false negatives in the latter. 


negatives in the automatically constructed ground truth. For this 
purpose, consider Fig.|^ which shows a histogram of the average 
human labels for the candidates that were labeled as negative ac¬ 
cording to the automatic ground truth. Here, ‘average human label’ 
refers to the average of the binary labels obtained from the ten hu¬ 
man raters for each candidate. We see that a large fraction of the 
examples labeled as negative according to the automated ground 
truth are in fact positive examples according to the more reliable 
human ground truth. 

3.2 Evaluation on Wikispeedia 

To conclude the evaluation, we present results on the second data 
set of navigation paths, collected via Wikispeedia. Recall from 
Sec. |2.1| that, while The Wiki Game has the advantage of more 
data, it also has a slight drawback: it does not use a static Wikipe¬ 
dia snapshot but rather fetches articles from live Wikipedia on the 
fly and caches them for some time (10| , which means that we do not 
know the exact version of the article the user saw at game time; the 
versions used in different games may be different from each other 
and from the reference snapshot Wj-. Hence, our evaluation on The 
Wiki Game could not account for what a source article s looked like 
at game time. Instead, we allowed for suggestion all links {s,t) not 
present in the reference snapshot Wj-, regardless of whether they 
existed during the respective game, and our automated evaluation 
counted a suggestion {s,t) as positive if the link was present for 
a substantial fraction of the entire lifetime of i. We call this the 
standard evaluation. 

Wikispeedia, on the contrary, uses a static Wikipedia snapshot 
W |48| , so we know exactly which links existed during the game. 
In the notation of Sec. |2.2| and Fig.|^ the snapshot Wt- is replaced 
by W, which is identical to the snapshot used in all games. This in 
turn allows for a stricter evaluation methodology: By allowing for 
suggestion only those links that were not present in W, we permit 
only links that did not exist during the game and that the user could 
thus not possibly have clicked. Further, we count a suggestion as 
positive only if it has been present in the live Wikipedia for a sub¬ 
stantial amount of time after the date of the static snapshot W. If a 
suggested link did not exist during the game, but was added after¬ 
wards, this is an even stronger signal that the suggestion is good. 
Hence we call this the stricter evaluation^ 


^When applying the standard evaluation to Wikispeedia, we use the 
same reference snapshot Wt also used for The Wiki Game in order 

































Now, if we can show that the stricter evaluation yields similar 
results to the standard evaluation on Wikispeedia, then we may ar¬ 
gue by analogy that the stricter evaluation would likely give similar 
results on The Wiki Game, too, if such an evaluation were possible 
on that data set. 

The results are displayed in Fig. |5(c)| and Fig .|5(d)[ As for The 
Wiki Game, we use the link-rate threshold (Sec. |2.5| ~Q = 30% for 
deciding if a suggestion is positive, and as described in the begin¬ 
ning of Sec. we consider targets for which our algorithm can 
make at least K = 5 suggestions and show the average precision @fc 
for k= 1,... ,5. 

Our first observation is that the standard-evaluation results are 
similar for Wikispeedia (Fig. |5(c)] ( and The Wiki Game (Fig. |5(a)) . 
In particular, the orderings of methods by performance are identi¬ 
cal. (The results are somewhat less clean for Wikispeedia, due to 
the smaller amount of data.) Further, the outcome of the stricter 
evaluation (Fig. |5(d)[ l is similar to that of the standard evaluation 
(Fig. |5(c)| l, the main difference being that SVD ranking performs 
better under the stricter evaluation. 

We therefore have reason to believe that the performance would 
also remain high on The Wiki Game under the stricter evaluation 
if this kind of evaluation were possible on that data set, which cor¬ 
roborates the result that our algorithm finds good new links. 

4. DISCUSSION 

This paper introduces an effective method for the source predic¬ 
tion problem (Fig.[T|b)), in which a target page 1 is given, and the 
task is to find and rank sources s that should link to t. Prior work 
(Sec.[^ has primarily addressed the complementary target predic¬ 
tion problem (Fig. [TJa)), where i- is given and t to be found. We 
consider source prediction more challenging than target prediction, 
since in the latter the set of link candidates is immediately given 
by the phrases contained in the source page j, whereas, in the for¬ 
mer, every page could potentially be a source, so the set of source 
candidates must first be retrieved in a candidate selection step. 

Computational feasibility. To illustrate this point, we briefly re¬ 
port on an experiment we had initially planned on doing. We in¬ 
tended to compare the performance of our method to the link pre¬ 
dictions made by Milne and Witten’s |26| machine-learned target 
prediction algorithm, but this was computationally infeasible: In 
order to use this target prediction method in a source prediction 
setting, we first had to find all articles s mentioning t (this required 
a full scan of a 44GB Wikipedia dump). Next, we intended to an¬ 
notate each source s with outgoing links and then rank s according 
to the score it gives to t. However, each annotation takes on the 
order of several seconds (2Z)> and nearly every article mentions at 
least one of the targets we want to evaluate, so we would have had 
to annotate essentially all of Wikipedia, which would have taken 
several million seconds, or several thousands of hours. One reason 
for the computational complexity of Milne and Witten’s algorithm 
is that they (as well as other target prediction methods |22||23|[5T| ) 
tend to spend significant effort on mention disambiguation. 

On the contrary, in our approach we neither have to scan Wiki¬ 
pedia for articles that mention f, nor do we need to do any sophis¬ 
ticated disambiguation or ranking. We simply use as source candi¬ 
dates all pages seen in our navigation traces, look for mentions only 
in this small subset of all Wikipedia pages, and rank according to 
a simple precomputed metric or simple frequency counts. This is 
possible because the brunt of the computational effort is done by 
humans: since they actively seek out pages that are likely to link to 

to decide whether a source mentions, or links to, a target. Under the 
stricter evaluation, the static snapshot W is used for this purpose. 


the target, these pages tend to already be good source candidates, 
and issues such as disambiguation are much less critical. 

Applications beyond Wikipedia. Now we address the question if 
and how our technique could apply beyond the realm of Wikipedia. 
We envision two ways forward. 

The first idea would be to gamify arbitrary websites. One could 
imagine a framework, e.g., written in JavaScript, that would wrap 
the website of interest, recruit players, and ask them to navigate 
to the targets we are interested in linking to. This would require 
adding at least some initial links pointing to t manually, such that 
t is reachable by navigating. Furthermore, our method for finding 
valid anchors for the target, which is currently based on anchor- 
text/target-page pairs mined from Wikipedia (Sec. |2.2| l, would need 
to be adapted to the new domain. Possibilities would include the 
use of prevalent phrases from the target’s title and content as anchor 
texts, or, akin to our current method, the use of anchor texts that are 
already being used in other pages to refer to the target. 

The second approach we envision is to use passively rather than 
actively collected log data for source candidate selection and rank¬ 
ing. It might be possible to simply use the logs that are kept by 
webservers anyway. The added challenge here would be that we do 
not know what target (if any) a user tried to reach, whereas the tar¬ 
get is always given explicitly to the user in the human-computation 
setup. However, we believe that reasoning along the following lines 
might be promising: if users that ended up in t often went through 
s, then the shortcut from i to t might be promising. An alternative 
heuristic might be to collect instances where a user navigates to s, 
issues a keyword query into the website’s search box (if it exists), 
and clicks to t from the search-engine result page. 

On Wikipedia, the linking guidelines are explicitly stated (49| , 
so links are fairly consistent. Further, each page is typically about 
a single, well-defined topic. These are among the reasons why ma¬ 
chine-learning methods can infer powerful models for linking to 
Wikipedia articles. Websites other than Wikipedia are less likely to 
have the above properties, so it will be more difficult for statistical 
models to predict meaningful links. We expect methods for mining 
missing links directly from navigational traces to suffer less from 
this problem, since they do not take the detour through modeling 
the static structure of the link graph, but instead directly optimize 
navigability as the objective. 

What we find especially promising in this light is a result from 
Fig. 0 namely that our method does not crucially rely on any mea¬ 
sure of relatedness between pages: ranking our source candidates 
simply by the frequency with which they occurred in navigational 
traces for the given target (the yellow curves of Fig. constitutes 
a competitive method. We believe that this makes our approach 
a strong candidate for the source prediction task on websites other 
than Wikipedia, where a notion of relatedness between pages might 
be much harder to obtain. 

5. RELATED WORK 

We find missing links by observing humans navigating a network 
during a human-computation game. There has been related work 
on several aspects of our approach: the link prediction problem, 
human network-navigation behavior, and games with a purpose. 

Link prediction. The link prediction problem in networks comes 
in many flavors and variants. Unsupervised methods for link pre¬ 
diction in social networks were extensively evaluated by Liben- 
Nowell and Kleinberg |19| , who found the Adamic-Adar mea¬ 
sure Q to perform best. More recently approaches based on net¬ 
work community detection and random walks Q were 

considered for predicting missing links. Supervised link predic- 




tion | |20| was also studied by the relational-learning community |31[ 
|34| , but scalability remains a challenge with these approaches. 

While the above works focused mostly on the identification of 
missing links in social networks, there is also a rich line of work on 
the identification of missing links among Wikipedia articles 13 
2^|44|[5T) and on linking existing webpages to Wikipedia (22[ 23 
2^ . Generally these approaches focus on building models of Wiki¬ 
pedia’s graph structure, while also performing keyword extraction 
and word-sense disambiguation. 

Human navigation in networks. This line of research is rooted 
in Milgram’s seminal small-world experiment ]24) , which asked 
participants to forward a letter to a friend such that it could finally 
reach a predestined target person. The game through which our 
data was collected is similar to this task in that a target must be 
reached in the absence of prior information about the underlying 
network. It is different in that our setup has the same user stay¬ 
ing in control from the start all the way to the target, whereas, in 
the small-world experiment, every step is executed by a new, au¬ 
tonomous participant. Kleinberg GZ) investigated the algorithmic 
aspects of the small-world problem, showing that efficient search is 
only possible when the probability of long-range links decays as a 
power law with a specific exponent. 

Much research has followed in Kleinberg’s wake, so we focus on 
the most directly related projects: data sets such as ours were pre¬ 
viously analyzed by West and Leskovec, who characterize human 
strategies in successful navigation tasks (42| and train machine¬ 
learning models capable of navigating automatically ED- and by 
Helic et al. 0 and Trattner et al. |36| , who explore heuristic navi¬ 
gation algorithms based on hierarchical knowledge representations. 

A related line of work pertains to the analysis of so-called ‘click 
trails’. Research here primarily studies the click paths on which 
users embark starting from search-engine result pages. In early fun¬ 
damental work, Chi et al. coin the notion of ‘information scent’, 
operationalized by Olston and Chi |30| in a system for supporting 
users by combining query- and click-based navigation strategies. 
White and Huang (46| establish that click trails add value to the 
information contained in the ultimate target page, and Teevan et 
al. |35| show that users frequently prefer click-based navigation 
to querying. Downey et al. ED investigate the benefits of navi¬ 
gating versus querying further, finding that navigating is particu¬ 
larly useful when the information need is rare. Work by White and 
Singla ED is relevant in that it explores how different trail topolo¬ 
gies (such as stars, trees, and linear chains) are observed in dif¬ 
ferent search scenarios (informational versus navigational). Click 
trails have also been used to predict whether users will give up in 
information network navigation |32| , to compute the semantic re¬ 
latedness of concepts (33[ |43| , and to identify relevant websites 
from user activity Q. Our work continues this line of work and 
attempts to use navigational trails as a rich source of data for de¬ 
tecting missing links in networks. 

Games with a purpose. ‘Games with a purpose’ (M) were popu¬ 
larized by von Ahn and colleagues, a seminal early example being 
the ESP Game for labeling images ED- Wikispeedia was origi¬ 
nally designed as a game with a purpose for computing the seman¬ 
tic relatedness between concepts (43) . Further relevant work was 
done by Ageev et al. (^, who developed a human-computation 
game for collecting data in which users are asked to find the an¬ 
swers to as many factual questions (e.g., ‘What is the highest peak 
in the Western Hemisphere?’) as possible within a given amount of 
time, using Web-search queries that may optionally be followed by 
click-based navigation. As in our navigation data sets, the goal is 


explicitly known here, but not in the form of a specific target page 
but rather in the form of a specific answer string. 

6. CONCLUSION 

In this paper we studied the problem of identifying missing links 
in Wikipedia. We built on the fact that the ultimate purpose of Wi¬ 
kipedia links is to aid navigation. Our method harnesses human 
navigation traces and finds missing links that would immediately 
enhance Wikipedia’s navigability. We analyze click trails to iden¬ 
tify a set of candidates for missing links and then rank these candi¬ 
dates. We experimented on both automatically labeled ground truth 
as well as ground truth obtained from human annotators. Overall, 
we obtained performance of the quality that would make our sys¬ 
tem useful in practice. 

There are many interesting avenues for future work. For in¬ 
stance, extending the method to passively collected Web-browsing 
logs would be a natural next step, and it would also be worthwhile 
to think about gamifying general websites beyond Wikipedia. 

In summary, our paper makes contributions to the rich line of 
work on detecting missing links on websites. We hope that future 
work will draw on our insights to build more user-friendly websites 
and make the Web more navigable as a whole. 
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APPENDIX 

A. HUMAN-RATER INSTRUCTIONS ON 
AMAZON MECHANICAL TURK 

The following description of the evaluation task was given to hu¬ 
man raters on Amazon Mechanical Turk: 

“Here’s the deal! Our good friend Wikipedia is having self¬ 
doubts and wants you to help improve its links. 

You are given a Wikipedia article (referred to as the target) and a 
list of other Wikipedia articles (referred to as source articles). You 
have to tell Wikipedia if the source article should contain a link to 
the target. And of course, if you are unsure of what the source or 
target article means, you can always click on the article name to 
open it in a new tab. 

But remember that Wikipedia is a sensitive fellow and will be 
mad if you don't play by the rules: There should be a link from the 
source to the target if and only if( I) the target article has some rel¬ 
evant information about the source article and could help readers 
understand the source more fully, or (2) the target article describes 
a proper name which is likely to be unfamiliar to readers." 


